Lexicon Optimization for Chinese Language Modeling

نویسنده

  • ZHAO Jun
چکیده

In this paper, we present an approach to lexicon optimization for Chinese language modeling. The method is an iterative procedure consisting of two phases, namely lexicon generation and lexicon pruning. In the first phase, we extract appropriate new words from a very large training corpus using statistical approaches. In the second phase, we prune the lexicon to a preset memory limitation using a perplexity minimization criterion. Experimental results show up to a 6% character perplexity reduction compared to the baseline lexicon.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A unified approach to statistical language modeling for Chinese

This paper presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigrams to Chinese is challenging because (1) there is no standard definition of words in Chinese, (2) word boundaries are not marked by spaces, and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a highquality training data set...

متن کامل

Improved Chinese broadcast news transcription by language modeling with temporally consistent training corpora and iterative phrase extraction

In this paper an iterative Chinese new phrase extraction method based on the intra-phrase association and context variation statistics is proposed. A Chinese language model enhancement framework including lexicon expansion is then developed. Extensive experiments for Chinese broadcast news transcription were then performed to explore the achievable improvements with respect to the degree of tem...

متن کامل

Language modeling of Chinese personal names based on character units for continuous Chinese speech recognition

In this paper, we analyze Chinese personal names to model their statistical phonotactic characteristics for continuous Chinese speech recognition. The analysis showed languagespecific characteristics of Chinese personal names and strongly suggested the advantage of character-unit oriented modeling. A hierarchical language model was composed by reflecting statistical phonotactic characteristics ...

متن کامل

Lexicon adaptation with reduced character error (LARCE) - a new direction in Chinese language modeling

Good language modeling relies on good predefined lexicons. For Chinese, since there are no text word boundaries and the concept of “word” is not very well defined, constructing good lexicons is difficult. In this paper, we propose lexicon adaptation with reduced character error (LARCE), which learns new word tokens based on the criterion of reduced adaptation corpus error rate. In this approach...

متن کامل

Unsupervised Learning of a Chinese Spontaneous and Colloquial Speech Lexicon with Content and Filler Phrase Classification

There is significant lexical difference—words and usage of words-between spontaneous/colloquial language and the written language. This difference affects the performance of spoken language recognition systems that use statistical language models or context-free-grammars because these models are based on the written language rather than the spoken form. There are many filler phrases and colloqu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000